× The video of this report is splitted into 4 videos for each of 4 sections. you can watch the whole video(11:16) from the link in side bar with pdf report and also the R notebook.

1.Introduction

In this Google Merchandise Store (also known as GStore, where Google swag is sold) data set, we are required to analyze a customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.

1.1.Goal

What are we predicting?

We are predicting the natural log of the sum of all transactions per user. Once the data is updated, as noted above, this will be for all users in test_v2.csv for December 1st, 2018 to January 31st, 2019. For every user in the test set, the target is:

\[y_{user} = \sum_{i=1}^{n} transaction_{user_i}\]

\[target_{user} = \ln({y_{user}+1})\]

1.2. Data Fields

First, let’s look at what information our data captures. By using knitr and kableExtra packages, we can generate more beautiful html table(see function show).

# build a knitr table
library(knitr)
library(kableExtra)

show <- function(table, caption = "Data Fields") {
  kable_styling(kable(table, digits = getOption("digits"), caption=caption),
                font_size = 12)
}

name <- c("fullVisitorId", "channelGrouping", "date", "device" , "geoNetwork", "socialEngagementType", "totals", "trafficSource", "visitId", "visitNumber", "visitStartTime", "hits", "customDimensions")

description <- c("A unique identifier for each user of the Google Merchandise Store.","The channel via which the user came to the Store.", "The date on which the user visited the Store.", "The specifications for the device used to access the Store.", "This section contains information about the geography of the user.", "Engagement type, either 'Socially Engaged' or 'Not Socially Engaged'.", "This section contains aggregate values across the session.", "This section contains information about the Traffic Source from which the session originated.", "An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.", "The session number for this user. If this is the first session, then this is set to 1.", "The timestamp (expressed as POSIX time).", "This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.", "This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.")
data.field <- data.frame(name, description)

show(data.field)
Data Fields
name description
fullVisitorId A unique identifier for each user of the Google Merchandise Store.
channelGrouping The channel via which the user came to the Store.
date The date on which the user visited the Store.
device The specifications for the device used to access the Store.
geoNetwork This section contains information about the geography of the user.
socialEngagementType Engagement type, either ‘Socially Engaged’ or ‘Not Socially Engaged’.
totals This section contains aggregate values across the session.
trafficSource This section contains information about the Traffic Source from which the session originated.
visitId An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
visitNumber The session number for this user. If this is the first session, then this is set to 1.
visitStartTime The timestamp (expressed as POSIX time).
hits This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.
customDimensions This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.

2.Preprocessing

2.1.Load data and have a glimpse

Data Fields
channelGrouping date device fullVisitorId geoNetwork sessionId socialEngagementType totals trafficSource visitId visitNumber visitStartTime
Organic Search 20160902 {“browser”: “Chrome”, “browserVersion”: “not available in demo dataset”, “browserSize”: “not available in demo dataset”, “operatingSystem”: “Windows”, “operatingSystemVersion”: “not available in demo dataset”, “isMobile”: false, “mobileDeviceBranding”: “not available in demo dataset”, “mobileDeviceModel”: “not available in demo dataset”, “mobileInputSelector”: “not available in demo dataset”, “mobileDeviceInfo”: “not available in demo dataset”, “mobileDeviceMarketingName”: “not available in demo dataset”, “flashVersion”: “not available in demo dataset”, “language”: “not available in demo dataset”, “screenColors”: “not available in demo dataset”, “screenResolution”: “not available in demo dataset”, “deviceCategory”: “desktop”} 1.13166e+18 {“continent”: “Asia”, “subContinent”: “Western Asia”, “country”: “Turkey”, “region”: “Izmir”, “metro”: “(not set)”, “city”: “Izmir”, “cityId”: “not available in demo dataset”, “networkDomain”: “ttnet.com.tr”, “latitude”: “not available in demo dataset”, “longitude”: “not available in demo dataset”, “networkLocation”: “not available in demo dataset”} 1131660440785968503_1472830385 Not Socially Engaged {“visits”: “1”, “hits”: “1”, “pageviews”: “1”, “bounces”: “1”, “newVisits”: “1”} {“campaign”: “(not set)”, “source”: “google”, “medium”: “organic”, “keyword”: “(not provided)”, “adwordsClickInfo”: {“criteriaParameters”: “not available in demo dataset”}} 1472830385 1 1472830385

Here is one row of our data, and several of the columns(e.g. device, geographical and traffic information) in the data set contain json. They contain mutiple information in one feature, so we need a way to parse this into several columns.

## Observations: 903,653
## Variables: 12
## $ channelGrouping      <fct> Organic Search, Organic Search, Organic Se…
## $ date                 <int> 20160902, 20160902, 20160902, 20160902, 20…
## $ device               <fct> "{\"browser\": \"Chrome\", \"browserVersio…
## $ fullVisitorId        <dbl> 1.131660e+18, 3.773060e+17, 3.895546e+18, …
## $ geoNetwork           <fct> "{\"continent\": \"Asia\", \"subContinent\…
## $ sessionId            <fct> 1131660440785968503_1472830385, 3773060208…
## $ socialEngagementType <fct> Not Socially Engaged, Not Socially Engaged…
## $ totals               <fct> "{\"visits\": \"1\", \"hits\": \"1\", \"pa…
## $ trafficSource        <fct> "{\"campaign\": \"(not set)\", \"source\":…
## $ visitId              <int> 1472830385, 1472880147, 1472865386, 147288…
## $ visitNumber          <int> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ visitStartTime       <int> 1472830385, 1472880147, 1472865386, 147288…

Besides, some data type is not accurate(e.g. in date column, we get numeric type data), and we need to cast them to the correct form.

2.2.Cast data type

First, date information here is a numeric type, we need to convert them into type date.time, which is a build-in date type in R.

2.4.Preprocessing missing value

First, we set all unknown value to build-in type NA. Some of the newly parsed columns from json have various values that can be converted to NA. This includes values such as ‘(not set)’ and ‘not available in demo dataset’. Although distinguishing between these values may be useful during modeling, we are going to convert them all to NA for the purposes of visualization.

Several of the columns newly parsed from json have only 1 unique value, e.g. ‘not available in demo dataset’. These columns are useless, so we drop them here.

All of the columns that were converted from json are of class character. For some, we will need to change them to numeric by using data.table operation.2

Let’s see our final result.

Data Fields
channelGrouping date fullVisitorId sessionId visitId visitNumber visitStartTime browser operatingSystem isMobile deviceCategory continent subContinent country region metro city networkDomain hits pageviews bounces newVisits transactionRevenue campaign source medium keyword referralPath adContent adwordsClickInfo.page adwordsClickInfo.slot adwordsClickInfo.gclId adwordsClickInfo.adNetworkType
Organic Search 2016-09-02 1.13166e+18 1131660440785968503_1472830385 1472830385 1 2016-09-02 15:33:05 Chrome Windows FALSE desktop Asia Western Asia Turkey Izmir NA Izmir ttnet.com.tr 1 1 1 1 NA NA google organic NA NA NA NA NA NA NA

Now, it becomes readable for us and analyzable for R language.

2.5.Missing data visualization

For get better visualization, we convert transactionRevenue back to unit dollars.

3.Data Exploration

What is the time range over which these data were collected?

## [1] "2016-08-01" "2017-08-01"

Target Variable (transaction revenue)

The object of this competition is to predict total transaction revenue, so let’ take a look at this variable first. What is the range of transaction revenue, or dollars spent per visit.

## [1]     0.01 23129.50

3.1.Statictical distribution of Revenue

Now let�s look at the distribution of revenue from individual visits. Here I am using the log of transaction revenue to better display the distribution.

According to the chart, the distribution of each user�s revenue is extremely skewed. Then we employ quantile-quantile plot to check its normality. If it shows a nonlinear pattern, suggesting that the data is not normally distributed. We don�t use shapiro-wilk test is because our dataset is too huge to this test(Shapiro-Wilk test will not cover for sample sizes more than 5000).

The log transformation is one of the most popular among the different types of transformations used to transform skewed data to approximately conform to normality. It is widely used in financial data. Now, we try it on our data.

The mean of the natural log of transaction revenue appears to be around 4 and is shown a beautiful bell-shaped curve. Then, we draw its QQ plot, and get a quite well-fitted QQ line. We now can believe these log-transformed data was generated from normally distributed population, and employ linear regression model on it in the future.

3.2.Temporal pattern of Revenue

Now let’s take a look at daily revenue over the time period of the data set.

The daily revenue data are pretty volatile, but there appears to be a regular pattern here. There seems to be a regular pattern of highs and lows. We’ll have to take a closer look at this. The smoothing line indicates that daily revenue, fluctuations aside, has remained fairly steady over the course of the year.

Now we look at revenue by hour of day.

3.4.Revenue by device

Interestingly, we see that although Windows accounted for more records in the data set than any other operating system, Macintosh accounted for more of transaction revenue than other operating systems by a large margin.

Let’s attempt to determine if there is a difference in transaction revenue between mobile and non-mobile devices:

There seems to a smaller mean transaction revenue for mobile devices than non-mobile devices, although we’d want to perform some statistical testing to determine if this effect is statistically significant or just occuring by random chance in this particular sample.

3.5. Revenue by geographic features

Revenue from the Americas dwarfs that of any other continent.

Next, let’s look at the distribution of total transaction revenue across countries. We’ll use the highcharter library to do this. We’ll use the countrycode package to convert the country names in the training data set to iso3 codes, which we can then use to join with the worldgeojson data from highcharter.

  • But it doesn’t work for your Rmarkdown, you need to do some extra work to generate html highchart3

Note that in the below maps, I am using the log of total transaction revenue rather than raw transaction revenue so that we get better dispersion for the choropleth palette.

We can also look at individual continents. Since we will want to plot data for multiple continents, we will write a function to do so.

4.Correlation analysis

Correlation analysis is a statistical method used to evaluate the strength of relationship between two quantitative variables, so we need to select numeric features from our data.

Feature Name:
fullVisitorId
visitId
visitNumber
hits
pageviews
bounces
newVisits
transactionRevenue

And we know userId is not a ‘real’ feature, so they should be dropped. Besides newVisits and bounces are unique value feature(NA and 1), so it should be consider as a binary category feature after one-hot encoding, and ,thus, we remove it as well.

Calculate their correlation matrix by Pearson’s method.

Correlation Matrix of numeric features and revenue
visitNumber hits pageviews transactionRevenue
visitNumber 1.000 0.041 0.043 0.309
hits 0.041 1.000 0.983 0.142
pageviews 0.043 0.983 1.000 0.129
transactionRevenue 0.309 0.142 0.129 1.000

4.Pageviews and Hits

Pageviews and Hits are both the behaviors of users while on the site and look for anything, so it might be correlated with transaction revenue. Below are bivariate distribution plots of pageveiews versus transaction revenue, and hits versus transaction revenue.

It seems like they do have a positive correlation. It’s not that easy to tell from the cloud of points whether there is relationship between hits and revenue and between pageviews and revenue. For the next step, we will fit a linear model to the data indicate that there is some positive correlation between the two in both cases.

In terms of further work, we will employ some techniques of feature engineering, such as bucketing, missing data imputation, feature selection and feature construction. Also, we will employ statistical regression methods, such as lasso, elastic net regression and other machine learning methods.